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Abstract 


A package of 38 low-level subprograms for many of the basic operations 
of numericeLL linear suLgebra is presented. The package is intended to be 
used with FORTRAN. The operations in the package are dot products, 
elementary vector operations, Givens transformations, vector copy and swap, 
vector norms, vector scaling, and the indices of components of largest magnitudf . 

The subprograms suid a test driver are available in portable FORTRAN. 

Versions of the subprogreims are also provided in assembly language for the 
IBM 360 / 67 , the CDC 66 OO and CDC 76 OO, and the Uni vac II 08 . 
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BASIC LINEAR ALGEBRA SUBPROGRAMS 
FOR FORTRAN USAGE 


C. L. Lawson, Jet Propulsion Laboratory 
R. J. Hanson, Sandia Laboratories, Albuquerque 
D. R. Kincaid, University of Texas, Austin 
F. T. Krogh, Jet Propulsion Laboratory 


1. Introduction 

Tl'iis paper describes a package, called the BLAS, of thirty-eight 
FORTRAN-callable subprograms for basic operations of numerical linear 
algebra. This paper €uid the associated package of subprograms and testing 
programs are the result of a collaborative voluntary project of the ACM- 
SIGNUM committee on basic linear algebra subprograms. This project was 
carried out during the period 1973-1977. 

The initial version of the subprogram specifications appeared in Ref. 
[l]. Following distribution of Ref. [l] to persons active in the develop- 
ment of numerical linear algebra software, open meetings of the project 
were held at the Purdue Mathematical Software II Conference, May, 197^ j 
Ref. [ 2 ], and at the National Computer Conference, Anaheim, May, 1975. 
Extensive modifications of the specifications were made following the Purdue 
meeting which was attended by thirty people. A few additional changes 
resulted from the Anaheim meeting. Most of the further Fortran code changes 
resulted from an effort to improve the design and to make them more 
robust. 
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2 . ReaBons for Developing the Package 


Designers of computer programs involving linear algebraic operations 
have frequently chosen to implement certain low-level operations such as the 
dot product as separate subprograms. This may be observed both in many 
published codes and in codes written for specific applications at many computer 
insteJLlatlons. Following are some of the reasons for taking this approach: 

(1) It can serve as a conceptual aid in both the design Md coding 
stages of a programming effort to regard' an operation such as the 
dot product as a basic building block. This is consistent with 
the ideas of structured programming which encourage modularizing 
common code sequences. 

(2) It improves the self-docrunenting quality of code to identify an 
operation such as the dot product by a unique mnemonic name. 

(3) Since a slgniflcaint amoxint of the execution time in complicated 
linear algebraic programs may be spent in a few low-level 
operations, a reduction of execution time spent in these operations 
may be reflected in cost savings in the running of programs. 

Assembly language coded subprogreuns for these operations provide 
such savings on some computers. 

(U) The programming of some of these low-level operations involves 
algorithmic and implementation subtleties that are likely to be 
ignored in the typical applications programming environment. 

For example the subprograms provided for the modified Givens 
transformation incorporate control of the scaling terms which 
otherwise can drift monotonically toward underflow. 

If there could be general, eigreement on standard names and parameter 
lists for some of these basic operations it would add the additloncd. benefit 
of portability with efficiency on the assumption that the assembly language 
subprograms were generally available. Such standard subprograms would provide 
building blocks with vrtiich designers of portable subprograms for higher 
level linear algebraic operations such as solving linear algebraic equations, 




eigenvalue problems, etc., could achieve additional efficiency. The 
package of subprograms described in this paper is proposed to serve this 
purpose. 

3. Scope of the Package 

Specifications will be given for thirty-eight FORTRAN-callable subprograms 
covering the operations of dot product, vector plus a scalar times a vector, 
Givens transformation, modified Givens transformation, copy, swap. Euclidean 
norm, sum of magnitudes, multiplying a scalar times a vector, and locating 
an element of largest magnitude. Since we are thinking of these subprograms 
as being used in an ANS FORTRAN context we provide for the cases of single 
precision, double precision, and (single precision) complex data. 

In Table 1 a concise summary of the operations pro”^ided and the 
conventions adopted for naming the subprot'rams is given. Each type of 
operation is identified by a root name. The root name is prefixed by one 
or more of the letters I, S, D, C, or 0 to denote operations on integer, 
single precision, double precision, (single precision) complex, or extended 
precision data types, respectively. For subprograms involving a mixture of 
data types the type of the output quantity is indicated by the left-most 
prefix letter. Suffix letters are used on four of the dot product 
subprograms to distinguish variants of the basic operation. 

If one were to extend this package to include double precision complex 

type data (COMPLEX *l6 in IBM FORTRAN) we suggest that the prefix Z be used 

in the names of the new subprograms. For example, subprograms CZDOTC and 
CZDOTU for the dot product of (single precision) complex vectors, with 

double precision accumulation, have been written for the CDC 6600. These 

may be obtained directly from Kincaid. 
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Table 1 


Suxnmary of Functions and Names 
Of the Basic Linear Algebra Subprograms 


Function 

Prefix and Suffix of 

Name 



Root of 
Name 

Dot Product 

SDS- DS- DQ-I DQ-A 

C-U 

C-C 

D- 

• 

S- 

-DOT- 

Constant Times a Vector Plus a Vector 

1 

c- 

D- 

S- 

-AXPY 

Sot-up Givens 

Rotation 



D- 

S- 

-ROTG 

Apply Rotation 

» 



D- 

S- 

-ROT 

Set-up Modified Givens Rotation 



D- 

S- 

-ROTMG 

Apply Modified Rotation 



D- 

S- 

-ROTM 

Copy X into y 

' 


c- 

D- 

S- 

-COPY 

Swap X and y 



c- 

D- 

S- 

-SWAP 

2-Norm (Euclidean Length) 


SC- 

D- 

S- 

-NRM2 

Sum of Absolute Values* 


sc- 

D- 

S- 

-ASUM 

Constant Times a Vector 

CS- 

c- 

D- 

S- 

-SCAL 

Index of Element Having Max Absolute 
Value* 


IC- 

ID- 

IS- 

-AMAX 


*For conplex components subprograms compute 


|xj| + IYjI Inatead of (xj + yj) 


1/2 
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Section 5 lists all of the subprogram names and their parameter lists, 
and defines the operations performed by each subprograun. 

The criterion for including an operation in the package was that it 
should involve Just one level of looping ajid occur in the usual algorithms 
of numerical linear eOgebra such as Gaussian elimination or the various 
elimination methods using orthogonal transformations. 

This orientation affected the specifications of SCASTJM and ICAMAX 
particularly. Although SASUM and DASUM compute norms we assumed 
that the usage of either of these subprograms in numerical linear algebra 
software would be for the purpose of computing a vector norm that was less 
expensive to compute than the norm. Thus for the complex version, 
SCASUM, instead of specifying the norm which would be 

w =^|rRe(x^)f + [Im(x^)f| ^ 
i ' 

we specified the less expensive norm, 

w= ^^jlRe(x^)I + llm(x^)lj . 
i ' ’ 

Similarly, whereas ISAMAX and IDAMAX may be regarded as determining 
the nc~m of a vector, we do not regard this as the essential property to 
be carried over to the complex case. Thus ICAMAX is specified to find an 
index j such that 

lRe(Xj)l + jlm(Xj)| = max^^j lRe(x^)l + llm(x^)lj 
rather than finding an index J such that 

[Re(Xj)]^ + [im(Xj)]^ = max^j [Re(x^) + [lm(x^)]^| 

In both the computation of the norm and the Givens transformation 


a naive computation of the squares of the given data would restrict the 
exponent range of acceptable data. "Riis package avoids this restriction 
by making use of ideas described by Cody, Ref. [u], and Blue, Ref. [l2]. 
Additionally, in the case of the Givens transformations, em idea of Stewart, 

Ref. ri3]» permits the storage of all the transformations of a matrix 
decomposition in the memory space occupied by the elements zeroed by the 
trans formation . 

The modified Givens transformation is a relatively new innovation among 
numerlceU. linear algebra algorithms. Refs. [3], [4], and f5]. The significamt 
features are the reduction of the munber of multiplications, the elimination 
of squai'e root operations, and the capability of removing rows of data in 
least sq lores problems. The details of this algorithm as Implemented in 
this package are given in the Appendix. 

4 . Programming Conventions 

Vector arguments are permitted to have a storage spacing between 
elements. This spacing Is specified by an increment parameter. For example, 
suppose a vector x having components x^, 1 = 1,...,N is stored in a DOUBLE 
PRECISION array DX( ) with increment parameter INCX. If INCX s 0 then x^ is 
stored in DX(l+(i-l)*INCX) . Tf INCX < 0 then x^ is stored in DX(l+(N-i)*|lNCX| ). 
This method of indexing when INCX < 0 avoids negative indices in the array 
DX( ) and thus permits the subprograms to be written in FORTRAN. Only 
positive values of INCX are allowed for operations 26-38 that each have a 

single vector argument. 

It is intended that the loops in all subprograms process the elements 
of vector arguments in order of increasing vector component indices, l.e., in 
the order x^, i = 1,...,N. This implies processing in reverse storage order 
when INCX <0. If these subprograms are implemented on a computer having 



parallel processing capability, it is recommended that this order of 
processing be adhered to as nearly as is reasonable. 


5. Specification of the BLA Subprograms 

Type and dimension information for variables occurring in the subprt Tran. 

specifications are as follows: 

mx = max(l,N»|lNCXl) 
iny * max(l,N*| INCY I ) 

INTE3GER N , INCX , INCY , IMAX 

REAL SC(mx),SY(my),SA,SB,SC,SS 

REAL SDl,SD2,SBl,SB2,SPARAM(5),SW,(.vC,iO) 

DOUBLE PRECISION DX(mx; ,DY(my) ,LA,DB,DC,DS 
DOUBLE PRECISION DDl , DDi^ , DBl , DB2 , DPARAI^ ( 5 ) , DW 

COMPLEX CX(mx),CY(ray),CA,CW 

Type declarations for function neunes are as follows: 

INTEGER ISAMAX,IDAMAX,ICAMAX 
REAL SDOT , SDSDOT , SNRM2 , SCNRM2 , S ASUM , SC ASUM 

DOUBLE PRECISION DSDOT, DDOT,DODOTI DQD0TA,DNRM2,DASUM 
COMPLEX CDOTC CDOTU 


Dot Product Subprograms 

1. SW = SDOT(N, SX, INCX, SY, INCY) 

2. DW = DSDOT ( N,SX, INCX, SY, INCY) 


N 

i=l 

N 

V 

i=l 


Double precision accumulation is used within the subprogram DSDOT. 


3. SW = SDSDOT (N, SB, SX, INCX, SY, INCY) 


w : = 


N 

“ " Z ’‘i^i 

i=l 


Accumulation of the inner product and addition of b is in double 
precision. Conversion of tne final result to single precision is 
done the same as the intrinsic function SNGL( ). 
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N 

U. DW » DDOT(N,DX,INCX,DY,INCY) w 

1*1 

N 

5. DW » DQDOTl(N,DB,QC,DX,INCX,DY,INCY) w :« c := b + ^ 

i=l 

The input data, b, x, and y, are converted internally to extended 
precision. The result is stored in extended precision form in 
QC( ) and returned in double precision form as the value of the 
function DQDOTI. 

N 

6. EW - DODOTA(N,DB,OC,DX,INCX,DY,INCY) w := c := b + c ■*■ x^y 

i=l 

The input value of c in QC( ) is extended precision. The value c 
must have resulted from a previous execution of DQDOTI or DQDOTA 
since no other way is provided for defining an extended precision 
number. The computation is done in extended precision arithmetic 
and the result is stored in extended precision form in CXI ( ) and 
is returned in double precision form as the function value DQDOTA. 

N 

7. CW = CDOTC(N,CX,INCX,CY,INCY) w x^y^ 

i=l 

The suffix C on CDOTC indicates that the complex conjugates of the 
components x^ are used. 

N 

8. CW = CDOTU(N,CX,INCX,CY,INCY) w := ^ x^y^ 

i=l 


The suffix U on CDOTU indicates that the vector ccanponents are 
used luiconjugated. 

N 


In the preceding eight subprogrsuns the value of / will be set to zero if 
N.O, - 


Elementary Vector Operation y := ax + y 

9. CALL SAXPY(N,SA,SX,INCX,SY,INCY) 
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10. CALL DAXPY(N,D/.,DX,INCX,DY,INCY) 

11. CALL CAXPY(N,CA,CX,INCX,CY,INCY) 

If a * 0 or if N s 0 these subroutines return immediately. 


Construct Givens Plane Rotation 

12. CALL SROTGCSA.SB.SCjSS) 

13. CALL DR0TG(DA,DB,DC,DS) 

Given a euid b each of these subroutines computes 


and 


n = 


r » 


c = 


! sgn(a) 
sgn(b) 

2 ? 
a(a +b ) 

! a/r if 

1 if 


if 

if 

1/2 


r 

r 


|a| > |b| 
|bl 2 |a| 


/ 0 ■ 

- 0 


I b/r if r ^ 0 
(0 if r » 0 


The numbers c, s, and r then satisfy the matrix equation 



The introduction of a is not essential to the computation of a 
Givens rotation matrix but its use permits later stable reconstruction 
of c and s from just one stored number, an idea due to Stewart, 

Ref. [ 13 ]. For this purpose the subroutine also computes 

i 

! s if |aj>|b|orifa = b = 0 
1/c if |a| i jb| ^ 0 and c 0 
1 if |a| i |b| ^ 0 and c - 0 
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The Bubrouilnea return r overwriting a, ai.d z overwriting b, as well 
as let imlng c amd s. 

If the user later wishes to reconstruct c amd s from z It can 
be done as follows 


If z « 1 set c a 0 and s > 1 


If |z| < 1 set c 
If |z| > 1 set c 



a Plauie Rotation 


14. CALL SROT(N,SX,INCX,SY,INCY,SC,SS) 

15. CALL DROT(N,DX,INCX,DY,INCY,DC,DS) 
Each of these subroutines computes 



If N s 0 or If c » 1 and s = 0 the subroutines return immediately. 
Construct a Modified (ilvens Transformation 

16. CALL SROTMT; (SD1,SD2,SB1,SB2,SPARAM) 

17. CALL DR0TM[;(DD1,DD2,DB1,DB2,DPARAM) 

I 

T 

The input quantities d^^, d^, b^, and b^ define a 2- vector Ta^ja^] 
In partitioned form as 



The subroutine determines the modified Givens irotation matrix H, as 
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defined in Eqs» (A. 6) - (A. 7) of Appendix 1 thet trAnsfoms b^, and 
thus zero. A representation of this matrix is stored in the 

array SPARAM( ) or DPARAM( ) as follows. Locations in PARAM not 
listed are left unchanged. 


PARAM(l) = 1 

PARAM(l) = C 

PARAM(l) = -1 

Case of Eq. (A .7) 

Case of Eq. (A .6 ) 

Case of rescaling 

^12 “ ^ ^21 ~ 

^11 * ^22" ^ 

PARAM(2) = h^j^ 

PARAM (2) = h^^ 

PARAMO) = h.^^ 

PARAMO) = hgi 

PARAM(5) = hg2 

PARAM(U) = h^ 

PARAM(U) = 
PARAM(5) = hg2 


In addition PARAM(l) = -2 indicates H = I. 

The values of d^, d^, and are changed to represent the 
effect of the transformation. The quantity which would be 
zeroed by the transformation is left vinchanged in storage. 

The input value of d^ should be nonnegative, but il^ can be 
negative for the purpose of removing data from a least squares 
problem. Further details can be found in Appendix 1, 

Apply a Modified Givens Transformation 

18. CALL SROTM(N,SX,INCX,SY,INCY,SPARAM) 

19. CALL DROTM(N,DX,INCX,DY,INCY,DPARAM) 

Let H denote the modified Givens transformation defined by 
the pareuneter array SPARAM( ) or DPARAM( ). The subroutines compute 




fx." 

i 

:= H 

i 

-^i- 


-^i- 


If N 5 0 or if H is an identity matrix the subroutines return 
immediately. See Appendix 1 for further details. 
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Copy g Vector x to y 


X 


y : = 


20. CALL SCOPY(N,SX,INCX,SY,INCY) 

21. CALL DCOPY(N,DX,INCX,DY,INCY) 

22. CALL CCOPY(N,CX,INCX,CY,INCY) 

Return immediately if N i 0. 



Interchange Vectors x and y x ;=: y 

23. CALL SSWAP(N,SX,INCX,SY,INCY) 
2k. CALL DSWAP(N,DX,INCX,DY,INCY) 

25. CALL CSWAP(N,CX,INCX,CY,INCY) 

Return immediately if N s 0. 


Euclidean Length or Norm of a Vector 

26. SW=SNRM2(N,SX,INCX) 

27. DW=DNRM2(NiDX,INCX) 

28. SW=SCNRM2(N,CX,INCX) 

If N £ 0 the result is set to zero. 



Siun of Magnitudes of Vector Ccanponents 


29. 

SW= S ASUM ( N , SX , INCX ) 


30. 

DW=DASUM (N , DX , INCX ) 


31. 

SW= SC ASUM ( N , CX , INCX ) 

N 


The functions SASUM and DASUM compute 
SCASUM computes 

w :=^ l*i 

i=l 


w jlReal(xpI+ |lmag(Xj^)| 

i=l ' 


The function 
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These functions return immediately with the result set to zero if 


N s 0. 


Vector Scaling x := ax 

32. CALL SSCAL(N,SA,SX,INfcX) 

33. CALL DSCAL(N,DA,DX,INCX) 

3U. CALL CSCAL(N,CA,CX,INCX) 

35. CALL CSSCAL(N,SA,CX,INCX) 

Return immediately if N ^ 0. 


Find Largest Component of a Vector 


36. IMAX-ISAMAX(N,SX,INCX) 

37. IMAX=IDAMAX(N,DX,INCX) 

38 . IMAX=ICAMAX(N,CX,INCX) 


The functions IS.^MAX emd IDAMAX determine the smallest index i 
such that |x. 1 = max||x |:J = 1 ,...,n|. 

The function ICAMAX determines the smallest index i such that 

|x^| = max|lReal(Xj)l + llmag(x^)l;J = 1 ,...,n|. 

These functions set the result to zero and return immediately 
if N s 0. 


6. Implemen t at i on 

In addition to the FORTRAN versions, all of the subprograms except DQDOTI 
and DQDOTA cure also supplied in assembler language for the Univac II 08 , the 
IBM 300 / 67 , and the CDC 66 OO and 76 OO. The FORTRAN versions of DQDOTI and 
DQPOTA use part of Brent's multiple precision package. Ref. [lUl. Assembler 
language modules for these two subprogreuns are given only for the Univac IIO 8 . 

Only four of the assembly routines for the CDC 66 OO and 76 OO take 
advantage of the pipeline architecture of these machines. The four routines 
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SD0T( ), SAXPY( ), SR0T( ), and SROTM( ) are those typically used in the 
innermost loop of confutations. Some timing results are given in section 8, 
The subprograms SMCHCN and DMCHCN provide three machine dependent 
parameters that are used by the five routines SROTG, DROTG, SNRM2, DNRM2 and 
SCNRM2. 'Diese parameters are: SMALL s smallest positive floating point 

number, BIG s biggest positive floating point number, smd EPS * relative 
arithmetic precision. They are computed by use of subprograms SMCHAR and 
DMCHAR. These two subprograms were provided by W. J. Cody. An individual, 
computer installation may wish to remove Cody's routines and simply have the 
subprograms SMCHCN and DMCHCN return the appropriate constants. The test 
driver prints these numbers so that their values will be known by the user 
installation. 

7. Relation to the ANS FORTRAN Standard 

As of this writing (May, 1977) the present American National Standard 
FORTRAN is the 1966 standard. Ref. [6-8], that we will refer to as 1966 
FORTRAN. A draft proposed revision to this stauidard is currently identified 
as FORTRAN 77, Ref. [9], presently in the final editing phase. 

The cailling sequences of the BLA subprograms would require that the 
subprograms contain declarations of the form 

REAL SX(MAX0(1,N*IABS(INCX)) 

to precisely specify the array lengths. Neither I966 FORTRAN nor FORTRAN 77 
permits such a statement. A statement of the form 

REAL SX(1) 

is permitted by major FORTRAN compilers to c<^ver cases in which it is 
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inconvenient to specify' an exact dimension. This latter form is used in the 
BLA subprograms even though it does not conform tr> 1966 FORTRAN. FORTRAN 77 


allows the form 

REAL SX(*) 

for this situation. Thus the BLAS can be made to conform to FORTRAN 77 by 
changing "I's" to "»'s" in the subprogram array declarations. 

8, Testing 

A Master Test Package has been written in FORTRAN and is included with 
the submitted code. This package consists of a main program and a set of 
subprogreims containing built-in test data and correct answers. It executes 
a fixed set of test cases exercising all thirty-eight subprograms or 
optionally any selected subset of these. 

The test driver also calls subroutines SMCHCN and DMCHCN and prints 
the values of machine dependent values determined by these subroutines. 

We have attempted to design the test cases and the Master Test Program 
to be usable on a wide variety o;’ non-decimal machines having FORTRAN systems. 

The Master Test Package has succebsfully executed, testing the FORTRAN 
coded version of the Basic Linear Algebra Subprograms, on Univac 1108, 

IBM 360/67, Burroughs 67OO, CDC 6600, emd CDC 760O computers. These tests 
have also been run successfully testing the respective assembler packages 
on he Univac ]''08, IBM 360/67, CDC 6600 and CDC 76OO computers. 

The following method of comparing true and computed numbers is used 
in the Master Test Package. Let z denote a pre- stored true result and let z 
denote the corresponding computed result to be tested. The numbers n and 
are prestored censtants that will be discussed belo^. The test program computes 

d * fl(z-z) 

g = fl(lal + lfl(0 *d)|) 
h = |al 

T = fl(g-h) 


17 


where fl denotes machine floating point arithmetic of the current working 
precision, either single precision or double precision. It is further 
ass\imed that g ana h are truncated to working precision before being used in 
the computation of t. 

The test is passed If t * 0 and falls if t / 0 . Note that t will be zero 
if |d) is so small that adding |fl(0*d)| to |rt| gives a result that is not 

distinguished from jo| when truncated to working precision. 

.9 

For example, suppose a » 1 . , 0 » « 5 » d » 10 v then the mathematical 
value of rr + 0*d is 1.0000000005, but the single precision computed value 
of g on the Univac II08 will be 1 . resulting in t » 0 . Thus in this case 
d is small enough to pass the test. 

The n\jmber n is prestored along with the correct result z in the testing 
program. In general, <7 has different values for different test cases. 

The number 0 is a "tuning" factor which has been determined empirically 
to make the test perform correctly on a variety of machines. Note that the 
stringency of the test is relaxed by decreasing the value of 0. This has 
been used to desensitize the testing to the effects of differences in the 
treatment of trailing digits in tho floating noint arithmetic of different 
machines . 

There are four different values of 0 prestored in the main program, 

TBLA, of the testing package. These values are called SFAC, SDFAC, DFAC, 
emd rOFAC. These are used for testing operations which are respectively 
single precision, mixed single and double precision, double precision, and 
mixed double and extended precision. 

It is intended that the test package be useful to amyone who undertakes 
the iiq)lementatlon of an assembly-coded version of this package. In working 
on a new machine, one may find it necessary to reduce the values of one or 
more of the numbers SFAC, SDFAC, DFAC, or DOF AC to obtain correct test 


performance. The authors would appreciate hearing of any new assembly- 
coded versions of the packages and of any need to reduce the values of 
these tuning parameters. 

9. Selected Timing Results for the IBM 360/67. CPC 6600 and Univac 1108 
Timing of Dot Products and Elementary Vector Operations 

The most obvious implementation of the dot product and elementary 
vector operations for vectors with unit storage increments are in-line 
FORTRAN loops 1 and 2: 



In -Line 
FORTRAN for 
Dot Products 
Loop 1 


In-Line 
FORTRAN for 
Elementary 
Vector Operations 

Loop 2 


The BLAS replacements for these in-line FORTRAN loops, using the same 
variable names auid appropriate type statements, are 


BLAS 

Replacement for 
Loop 1 

BLAS 

Replacement for 
Loop 2 

The in front of the BLA subprogram names is due to the fact that both 
single and double precision versions are discussed here. 

These subprograms, coded in assembly language, were timed and compared 


CALL _AXPY(N,A,X,1,Y,1) 


W = _D0T(N,X,l,y,l) 
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with the time for the in-line loops. As was stated in section 2 , one 
reason for development of the package was to make highly efficient code 
possible. This goad has been achieved for the CDC 6600 but not for the 
IBM 360/67. The IBM 36o/6? FORTRAN H compiler, operating with Opt = 2, 
generates nearly perfect object code. 

In Tables 2 and 3 are some sample times for the three machines comparing 
Loops 1 and 2 auid their BLAS replacement. Interpretation of Tables 2 and 3> 
supported more f ully in Appendix 2, are as follows; 

• Because of linkage overhead, the BLA subprograms for the IBM 360/67 
are always less efficient than the in-line loops. For vectors of 
large enough length the linkage overhead is relatively negligible. 

• The dot product and elementary vector operation subprograms for the 
CDC 6600 are respectively 3.1 and 1.6 times more efficient than in-line 
code for vectors of large enough length. 

• For the CDC 66OO, dot products are considerably more efficient 
than elementary vector operations on vectors of the same length. 
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IBM 360/67 
Double Precision 


CDC 6600 
Single Precision 


In-Line 

FORTRAN 


Assembler 


In-Line 

FORTRAN 

FTN.0Dt*2 


Assembler 


0.C480 

0.0625 

0.0800 

0.1250 


Time, in seconds, for 1000 executions of in-line FORTRAN Loop 1 and calls to 
the _D0T( ) function. Times for 5 rvins were aversiged. 

Apply factors of 1.1 and 0.75 to IBM 360/67 times to get approximate respective 
times for nonequally spaced increments and single precision. No distinction for 
nonequal increments is necessary for the CDC 660O and Univac II08. 



Univac II08 
Single Precision 

In-Line 

FORTRAN 

Assembler 

0.0756 

0.0790 

0.1836 

C.I73O 

0.3598 

0.3182 

0.6986 

0.6162 


Table 2. D0T( ) function emd in-line Loop 1 timings 


Vector 

Length, 

N 


IBM 360/67 
Double Precision 

CDC 6600 
Single Precision 

Univac 1108 
Single Precision 


In-Line 

FORTRAN 

(H,0pt=2) 

Assembler 

In -Line 
FORTRAN 

(FTN,0pt=2) 

Assembler 

In-Line 

FORTRAN 

Assembler 

10 


0.0590 

0.2050 

0.0500 

0.0650 

0.0740 

0.0&Q() 

25 


0.3930 

0.4375 

0.1125 

0.1000 

0.1806 

0.189*3 

50 


0.7950 

0.8400 

0.2100 

0.1725 

0.3544 

0.3574 

100 



1.6000 

0.4200 

0.3000 

0.7292 

0.7170 


Time, in seconds, for 1000 executions of in-line FORTRAN Loop 2 and calls to the 
_AXPY( ) subprogram. Times for 5 nuis were averaged. 


Apply factor of 0.75 to get single precision IBM 360/67 times. Only vectors 
with unit increments were used in this timing. 


Table 3. AXPY( ) subprogram and in-line Loop 2 timings 
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riming of Standard and Modified Givens Methods 


Gentlemen's modification of the Givens transformation is discussed in 
the Appendix. This technique eliminates square roots auid two of the four multiply 
operations when forming the product of the resulting matrix by a 2-vector. 

The relative efficiency of Gentlemen's modification to the standard 
Givens transformation was compared. Both techniques were used to 
triangularize 2N by N matrices A = {a. .1 where 

J. J 

a^j = (iM-1)"^ 

In Table 4 there are sane sample times which resulted from the 
trlangularlzations using both methods. 

We are primarily interested in algorithm comparison here, so both 
methods were timed using their assembler versions to apply the matrix 
products. 

A conclusion is that in the context of triangularizing matrices, the 
modified Givens transformation method is ultimately more efficient in 
computer time by factors varying between 1.4 and 1.6. This is fully 
supported in Appendix 2. The comparison is most favorable on the IBM 
360/67 in double precision. 


N 

IBM 360/67 

CDC 6600 

Univac 

1108 


Double Precision 

Single Precision 

Single Precision 


Standard 

Modified 

Standard 

Modified 

Stemdard 

Modified 


Givens 

Givens 

Giver.'? 

Givens 

Givens 

Givens 

10 


0.0650 

0.0200 

0.0190 


0.0298 

25 

0.8789 

0.6250 

0.1719 

0.1445 

n 

0.3001 

Time, in seconds, for the triangularization of 2N by N matrices using steindard 

and modified Givens transformations. Times for 5 rxins were averaged. 



Table . Standard and modified Givens transformation in matrix 
triangularization 
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Appendix 1 


The Modified Givens Transformation 


The Givens transformation which eliminates z^, if / 0, is 


(A.l) 


GW 


■ t: :) 


w, . . .w„ 
1 N 




( 2 2r 

w^+z J . This requires ~UN floating point 
multiplications, 2N floating point additions and one square root. Gentlemeui, 
Ref. [ 3 ], has reported on a modification to the Givens transformation which 
reduces this operation count. Gentleman's idea is presented here in a 
slightly different form than found in his paper. 

Suppose that w in Eq. (A.l) is available in factored form 


(A. 2) 


W = D^X s 


it 0 


A 


*1* -^N 
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(A.U; 


(A.5) 


1 1 I. 

Substituting D^X for W amd refactoring GD^ as H‘^}i yields 

(A. 3) GW = GD^X = D ^HX * 

The right-hand side of Eq. (A. 3) yields an updated factored form for the 
matrix product GW. The crucial point is that the matrix H is selected so 
that two elements are exactly units. This eliminates 2N floating point 
muJtiplications when forming the matrix product HX. To preserve numerice ^ 
stability two cases aie considered: 

For |s| < |c| 



GD 


i. 


d^c 

d^8 

s 

d^c 

0 


-d^s 

1 

V 

Hcyvj 


_0 

d|c_ 



1 t (Vdj^) 
1 


f 


d^ 0 


0 "ai 


1 Vi/Vi 

•V*! 1 . 


« 5 


where t = s/c. 


For |c| i ls|j by similar manipulations, 

d,^ 0 


GD^ = 


H 2 


^ JL 




5 


-1 

where ^ = d^s, and ^ = d^s. This factorization can be done for any plane 
rotation matrix. 

Only the squares of the scale factors d^ are involved in the non-unit 
elements of the matrix H defined in Eq. (A.U) - (A. 15), which permits the 
Givens transformation Eq. (A.l) to be computed without square roots. Using 

n 2 

the identity c*" = (1+t ) and Eq. (A.U) allows the squares of the scale 
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factors to be updated; d^^ = d^(l+t^) , i . 1,2. Letting t = s/c in Eq. 

~ 2 -1 ^ 2 

(a. 5) we have d^^ = d^Cl+T ) and d^ = d^(l+r ) . For [c] > |s| or, 

2 2 

equivalently, |d^x^] > 


‘n 


12 


1 , = -y^/x^ 

Vl/ Vl • "22 ' ^ 


(A.6) 


u = 


di : = 


1 - i^2l\2 


d^/u 


dg/u 


Xi := 


XiU 


2 2 

For lc| s: |s| or, equivalently, i 


= V/Vl > *’21 
^12 " ^ * ^22 * 


= 1 


(A.7) 


^ ^ ^ \l^22 


V = d^/u 


di := d^/u 


dg := V 


x^ := y^u 


When using the modified Givens transformation in the context of 
"row accumulation," d^ > 0, i = 1, 2, the values of u in Eq. (A.6) - (A.7) 
will satisfy 1 s u ^ 2. Thus the squares d^, i = 1,2, decrease by as much 
as 1/2 at each updating step. If no rescaling action Is taken, these scale 
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factors would ultimately underflow. The details concerning rescaling 
are Implemented In the modified Givens subprograms. 

Since only d^^, the squares of the weights, appear In the formulas of 
Eq. (a. 6) - (A. 7) It Is possible to use the same formulas to remove a row 
from a least squares problem simply by setting d^ = -1. Remarks about 
this row removal method are found In Ref. [5l, Chapter 27. 

When the modified Givens transformation Is used In the context of the 
"row removal method" mentioned above, the values ofu In Eq. (A. 6) - (A. 7) 
satisfy 0 i u i 1. The case u = 0 Is eliminated by restricting d^^ 2 0. If 
d^ < 0, we define H as the zero matrix, the updated d^^ = 0, 1 = 1,2, and x^^ = 0. 
With this restriction, we have 0 < u ^ 2 in Eq. (A. 6) - (A. 7). Thus the 
change in the scale factors d^, 1 = 1,2, is unbounded at each step. Either 
underflow or overflow can occur if no rescaling is performed. 

The problem is rescaled by the modified Givens subprograms to keep 
within the conservative limits 


y~^ s |d^| s , 1 = 1,2, V = U096 . 


2 -1 
Note that when we rescad.e d^^ := d^v , we must rescale h^^j := , 

J = 1,2, €Uid rescale x^ := x^v”^. 
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Appendix 2 


Extended Timing Results for Some Operations 

In Section 9 selected timing results were presented for the IBM 360/67 
(double precision), the CDC 66OO (sir^le precision), and the Univac II08 
(single precision). Timing of dot products, elementary vector operations, 
amd Givens transformations was presented. This was done mainly for the 
purpose of illustrating the relative efficiency of in-line FORTRAN vs. 
assembler, eind the standard vs, the modified Givens transformation. 

Tables 5-11, given below, give more of this data than found in 
Section 9* The exception to this is the Univac II08 timing data which is 
totally presented in Section 9, so we did not reproduce it here. 


Vector 

Length, 

N 

IBM 360/67 

Single Precision 
Equal Storage Increments 

IBM 360/67 

Single Precision 
Nonequal Storage Increments 

10 

25 

50 

100 

In-Line 

FORTRAN 

(H,0pt=2) 

Assembler 

In-Line 

FORTRAN 

(H,0pt=2) 

Assembler 

0.1020 
• 0.2380 
0.4620 

0.9490 

0.1470 

0.2840 

0.5110 

0.9970 

0.1160 

0.2740 

0.5510 

1.1700 

0.166c 

0.3100 

0.5720 

l.lOX 

Time, in seconds, for 1000 executions of in-line FORTRAN Loop 1, 
Section 9, and calls to the SD0T( ) function. Times for 5 runs 
were averaged. 


Table 5* IBM 360/67 SDOT( ) function and single precision 
in-line Loop 1 timings 
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Vector 

Length, 

N 

IBM 360/67 
Double Precision 
Equal Storage Increments 

IBM 360/67 

Double Precision 
Nonequal Storage Increments 


In-Line 

FORTRAN 

(H,0pt=2) 

Assembler 

In-Line 

FORTRAN 

(H,0pt»2) 

Assembler 

10 

0.1430 

0.1910 

0.1590 

0.1980 

25 

0.3430 

0.3840 

0.3840 

0.4160 

50 

0.6770 

0.7250 

0.7800 

C.8180 

100 

0,3900 

1.3900 

1.5400 

1.5700 


Time, in seconds, for 1000 executions of in-line FORTRAN Loop 1, 
Section 9» aJid calls to the DD0T( ) function. Times for 5 runs were 
averaged. 


Table 6. IBM 36c/67 DD0T( ) function and double precision in-line 
Loop 1 timings 


Vector 

Length, 

N 

CDC 6600 

Single Precision Equal or 
Nonequal Storage Increments 

CDC 7600 

Single Precision Equal or 
Nonequal Storage Increments 


In-Line 

FORTRAN 

(FTN,0pt=2) 

Assembler 

In-Line 

FORTRAN 

(FTN,0pt=2) 

Assembler 

10 

0.0358 

0.0480 

0.0042 

0.0092 

25 

0.0756 

0.0638 

0.0100 

0.0110 

50 

0.1420 

0.0808 

0.0210 

0.0162 

100 

0.2750 

0.1230 

O.C4l4 

0.0254 


Time, in seconds, for 1000 executions of in-line FORTRAN Loop 1, 
Section 9, and calls to the SD0T( ) function. Times for 5 runs were 
averaged . 


Table 7, CDC 6600 and CDC 7600 SD0T( ) function and single precision 
in-line Loop 1 timings 



Vector 

Length, 

N 

IBM 360/67 
Single Precision 
Equal Storage Increments 

IBM 360/67 

Double Precision 
Equal Store/’e Increments 

10 

25 

50 

100 

In-Line 

FORTRAN 

(H,0pt=2) 

Assembler 

In-Line 

FORTRAN Assembler 

(H,0pt=2) 

0.1190 

0.2880 

o.*;t6o 

1.1700 

0.1700 

0.3610 

0.6300 

1.1900 

0.1590 O.20UO 

0.3930 0.4390 

■0.7960 0.8420 

1.5500 1.59CO 


Time, in seconds, for 1000 executions of in-line FORTRAN Loop 2, 
Section 9, and calls to the SAXPY( ) and DAXPY( ) subprograms. 
Times for 5 runs were averaged. 


Table 8. IBM 360/67 SAXPY( ) and DAXPY( ) subprogram, and 

single and double precision in-line Loop 2 timings 


Vector 

Length, 

N 

CDC 6600 

Single Precision 
Equal Storage Increments j 

CDC 7600 
Single Precision 
Equal Storage Increments 

10 

25 

50 

100 

L 

In-Line 

FORTRAN 

(FTN,0pt=2) 

Assembler 

In-Line 

FORTRAN Assembler 

(FTN,0pt=2) 

0.0502 

0.1120 

0.2130 

0.4240 

0.0640 

0.1020 

0.1710 

0.3020 

O.CO6O 0.0114 

0.0150 0.0162 
0.0290 0.0252 
0.0582 0.0420 

Time, in seconds, for 1000 executions of in-line FORTRAN Loop 2, 
Section 9 k nnd ceills to the SAXPY( ) subprogram. Times for 5 
nu’.s were averaged. 


Table 9. CDC 66OO and CDC 7600 SA>7°Y( ) subprogram and single 
precision in-line Loop 2 timings 





N 

IBM 360/67 
Single Precision 

IBM 360/67 

Double Precision 


Standard 

Givens 

Modified 

Givens 

Standard Modified 

Givens Givens 

10 

25 

0.0580 

0.5850 

0.048U 

0.4635 

0.0800 0.0650 

0.8789 0.6250 


Tl-ne, in seconds, for the triejigularizat^on of 2N 
\>'j 1.' matrices using standard and modified Givens 
tr.'irisforr. ationr . rime.i fcr 5 :-unti were.* averaged. 


lat»le 10. IBM "jSofSj sirigle /uio double 

precision ctaridexd and modified 
Gi '■eii'5 transfoi’Tiat.ion 'iiming for 
macrix • riangularl zation 


N 

CDC 6600 
Single Precision 

CDC 76CO 

Single Precision 


Standard 

Givens 

Modified 

Givens 

Standard Modified 

Givens Givens 

10 

25 

50 

100 

0.0200 

0.1719 

0.9600 

5.8500 

0.0190 

0.1445 

0.7550 

4.3500 

0.0036 0.0035 
0.0279 0.0250 
0.1430 0.1265 
0.8200 0.7100 


Time, in seconds, for the triangvLLarization of 2N 
by N matrices using standard and modified Givens 
transformations. Times for 5 runs were averaged. 


Table 11, CDC 6600 and CDC 7600 single precision 
standard and modified Givens trans> 
formation timing for matrix 
triangular! zatlon 




Appendix 3 


Sample Usage of the BLAS in FORTRAN Rrogreunming 


Our experience indicates that using the BLAS actually enhances the 
readability and reliability of codes in which they are utilized. Efficiency 
does not appreciably degrade with their usage, as indicated in Section 9> and 
for large-scale problems certain of the BLAS will markedly out-perform in-line 
FORTRAN code. 

These remarks are based on useige of the BLAS in developing new software 
for the Sandia Math. Library, developing new ordinary differential equation 
solving codes, conversations with members of the LINPACK working group 
participating in the project of Ref. [lO], and experience with applications 
programmers at Sandia Laboratories ajid Jet Propulsion Laboratory. 

Typical usage of the BLAS in FORTRAN programs is now illustrated with 
nine examples using the single precision versions of the operations. 

Some rules, based upon the FORTRAN language, that a programmer may find 
useful to recall are these: 

• Suppose a two-dimensional FORTRAN array A(MDA,NDA) is used to hold an M 

by N matrix A = If A(I,J) := a , then the Ith row vector of A 

and the Jth column vector of A respectively start at A(l,l) and A(l,j). 

The relations MDA ^ M and NBA 2 N must hold for the matrix to fit into 
this array. 

• The storage increment between elements of row vectors of A, e.g. A(l,l) 
and A(1,2), is MDA, the first dimensioning parameter of the array A(*,*), 

• The storage Increment between elements of column vectors of A, e.g. A(l,l) 
and A(2,1), is 1. This is due to the fact that the FORTRAN language 
stores A(*,*) by columns: 

A(1,1),A(2,1),...,A(MDA,1),A(1,2) A(MDA,2) , . . . , A(MDA,NDA) 

The value of NDA is used by the FORTRAN compiler only to allocate MDA*NDA 
words of memory in the program. 
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Example 1 


Given M by K and K by N natrices A and B, conqjute the M by N product 
matrix C « AB. 

The coding technique for this computation is based on the fact that 

each element c of C is the dot product of row I of A auid column J of B. 
XJ 


DIMENSION A(20,20),B(15,10),C(20,15) 

C 

MDA«20 

MDB=15 

MDC=20 

C 

M=10 

K=15 

N=10 

C 

C FOPM THE DOT PRODUCT OF ROW I OF A WITH COLUMN J OF B. EACH OF THESE 

C VECTORS IS OF LENGTH K. THE VALUE OF MDA IS THE STORAGE INCREMENT 

C BETWEEN ELEMENTS OF RCW VECTORS OF A. 

C 

DO 10 1=1, M 
DO 10 J=1,N 

10 C(I,J)=SD0T(K,A(I,1),MDA,B(1,J),1) 


Example 2 

Solve an N by N upper triangular nonsingular system of algebraic 

equations. Ax = b . The method used is based on the observation that if 

we compute the ccxnponent Iben we have a new problem in N - 1 

unknowns, still upper triangular, with the new right-side vector 

T 

^^l“*^lN^****’^N-l"%-l N^^ • example the solution vector, x, 

overwrites the vector b in the array B(*). 

DO 20 11=1, N 
I=N+1-II 
B(I)=B(I)/A(I,I) 

CALL SAXPY (I-1,-B(I),A(1,I),1,B,1) 
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Example 3 


Scale the columns (each assumed to be nonzero) of an M by N matrix C 
so that each column has unit length. 


DO 30 J=1,N 

T=1.E0/SNRM2(M,C(1,J),1) 
30 CALL SSCAL(M,T,C(1,J),1) 


acample U 

Row-equilibrate an N by N matrix A. (Divide each non-zero row vector 
of A by the entry in that row of maximum maignitude). Here MDA is the 
first dimensioning parameter of the array A(*,*). 

DO UO 1=1, N 

JMAX=ISAMAX(N,A(I,1),MDA) 

T=A(I,JMAX) 

IF(T.EQ.O.EO) GO TO 40 

CALL SSCAL(N,1.E0/T,A(I,1),MDA) 
ho CONTINUE 

When using ISAMAX( ) to choose row pivots in Gaussian elimination, 
for example, the major loop contains a statement of the form 

IflAX=ISAMAX(N-J+l, A(J, J) ,l)+J-l 

At that point IMAX corresponds to the row that will be interchauiged 
with row J. Thus the offset value J - 1 must be added to the computed 
value of ISAMAX( ) to get the actvial row number to interchange. 

Example ^ 

Set an N by N matrix A to the N by N identity matrix. Then set B = A, 
Notice that a storage increment value of 0 for the first vector 
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argument of SCOPf( ) is used. This "broadcasts” the values of O.BO and 
l.EO into the second vector argument. 

Here MDA is the first dimensioning parameter of the array A(*,*). 

DO 50 J»1,N 

50 CALL SC0Py(N,0.E0,0,A(l,J),l) 

CALL SC0PY(N,1.E0,0,A,MDA+1) 

DO 60 J*1,N 

60 CALL SCOPy(N,A(l,J),l,B(l,J),.l) 


Ebcample 6 

Interchange or swap the columns jf an M by N matrix C. ^Tie column 
to be interchanged with column J is in a type INTEGER array IP(*), and 
has the value IP(J). 


DO 70 J=1,N 
L=IP(J) 

IF(J.NE.L) CALL SSWAP(M,C(1,J) ,l,C(l,L) ,1) 
70 CONTINUE 


Example 7 

a) Extract the first number amd "pop" a list of N single precision 

nvnnbers: x^ := x^, x^^ ;= 1 = 1,...,N-1, N N-1 

b) "Push-down" a list of N siigle precision numbers and Insert a 

new number x^ at the top of the list: := i = N,...,l; 

*1 *0’ N ^ 

T 

For these illustrations the vector x = (x^,...,x^) is in the FORTRAN 
array X(*). 

Notice tlie usage of the negative increments (-1) for the push-down 
example of b). This causes the assignment 

X(N+1)=X(N) ,X(N)=X(N-1) , . . . ,X(2)=X(1) 
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to be Implemented In this order 


a) Extract and "pop" 

N=N-1 

X&=X(1) 

CALL SC0PY(N,X(2),1,X(1),1) 

b) "Push-down" and insert 

CALL SC0Py(N,X(l),-l,X(2),-l) 

N=N+1 

X(1)=X0 


Example 8 

In this example we want to transpose an N by N matrix A in-place, 
( in - situ ) . Here MDA is the first dimensioning parameter of the array 

DO 80 J=1,N 

00 CALL SSWAP(N-J,A(J,J+1),MDA,A(J+1,J),1) 


Exaag)le 9 

In this more complicated example we swap in-place ( in - situ ) the 
components of the vector 


so they become 


(Xi,...,x^,Xj^^^,...,Xn) 


• • • »^|^) 


making repeated vae of the "Pop" or "Push-down" operations. 


NMK^^N-K 

IF(.N0T.(K.GT.0.AND.NMK.GT.0)) GO TO 120 

IF(.N0T.(K.LT.NMK)) GO TO 100 
DO 90 1=1, K 
l'=X(l) 

CALL SC0PY(N-1,X(2),1,X(1),1) 
X(N)=T 
GO TO 120 


90 
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100 

CONTINUE 


DO no I-l.NMK 


T-X(N) 


CALL SC0PY(N-1,X(1),-1,X(2),-1) 

110 

X(l)-T 

120 

CONTINUE 
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